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ABSTRACT 


Associated  with  each  learning  system  there  is  a  class  of  leamable  behaviors.  If  the  target 
behavior  to  be  acquired  is  in  the  leamable  class,  it  will  be  learned  perfectly.  If  it  is  outside  that  class, 
the  machine  will  only  be  able  to  acquire  a  behavior  that  approximates  the  target  and  it  will  always 
make  errors.  It  is  desirable  for  a  learning  machine  to  have  a  large  leamable  class  to  maximize  the 
chances  of  acquiring  the  unknown  behavior  and  to  minimize  the  expected  error  when  only  an  approxi¬ 
mation  is  possible.  However,  it  is  also  desirable  to  have  a  small  leamable  class  so  that  learning  can  be 
achieved  rapidly.  Thus  the  design  of  learning  machines  involves  selecting  a  position  on  the  spectrum: 
minimum  error  and  slow  learning  time  versus  larger  error  and  faster  learning  time. 


Machines  that  have  fast  learning  times,  relatively  small  leamable  classes,  and  thus  relatively  large 
expected  errors  are  called  ‘‘realization  sparse"  in  this  paper.  It  is  shown  that  many  common  learning 
systems  are  of  this  type  including  signature  tables,  linear  system  models,  and  conjunctive  normal  form 
expression  based  systems. 


These  studies  lead  to  the  concept  of  an  “optimum"  machine  which  spreads  its  leamable  behaviors 
across  the  behavior  space  in  a  manner  to  minimize  the  expected  error.  An  approximation  to  such 
optimum  machines  is  presented  and  its  behavior  is  compared  to  the  more  traditional  learning  machines. 
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Introduction 

A  basic  characteristic  of  many  learning  systems  is  that  they  are  able  to  acquire  only  a  lim¬ 
ited  class  of  behaviors,  a  class  that  may  be  far  short  of  the  set  of  all  possible  behaviors.  If  the 
target  behavior  is  in  the  learnable  class,  the  learning  machine  may  be  able  to  acquire  it  as  train¬ 
ing  information  becomes  available.  If  the  target  behavior  is  outside  of  the  learnable  class,  it  will 
not  be  achieved.  The  best  the  system  can  do,  in  this  case,  is  to  try  to  find  a  learnable  behavior 
close  enough  to  the  target  to  give  few  errors  and  satisfactory  if  not  perfect  behavior. 

Learnability  classes  may  be  studied  in  terms  of  basic  characterizations  and  various  dimen¬ 
sions  including  their  sizes,  the  amount  of  information  or  time  required  to  learn,  and  the  distri¬ 
bution  of  the  class  across  the  space  of  all  possible  behaviors.  It  is  desirable  to  have  a  large 
learnable  class  because  this  increases  the  chances  of  learning  an  unknown  function  or  of  approxi¬ 
mating  it  accurately.  However,  if  the  error  problems  are  not  too  severe,  it  is  advantageous  to 
have  a  small  learnable  class  because  then  convergence  to  the  final  solution  is  much  faster.  Thus 
there  is  a  necessary  tradeoff  between  accuracy  and  rate  of  learning  which  the  system  designer 
must  understand.  Another  issue  of  interest  is  the  distribution  of  the  learnable  class  in  the  space 
of  all  possible  behaviors.  If  the  class  is  uniformly  spread  across  the  space,  randomly  selected 
behaviors  within  the  space  will  tend  to  all  be  learned  equally  well.  However,  if  the  class  is 
clustered  in  one  or  a  few  regions,  target  behaviors  within  those  regions  will  be  acquired  well  and 
others  will  be  acquired  poorly  or  not  at  all. 

This  paper  is  concerned  with  these  issues.  First  we  examine  relationships  between  the 
number  of  learnable  functions,  the  time  required  to  learn,  and  the  worst  case  and  expected  rates 
of  errors.  Next  we  define  a  class  learning  machines  which  are  called  ,ealizatiorx  sparse 

machines  and  show  that  large  such  machines  will  do  only  slightly  better  than  a  random  coin 
flipping  machine  on  most  learning  problems.  Then  we  examine  several  common  types  of 
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machines,  specifically  the  signature  table  systems,  linear  models,  and  conjunctive  normal  form 
machines  and  show  that  they  are  all  realization  sparse.  In  fact,  we  show  that  all  learning 
machines  that  learn  within  a  reasonable  time  are  realization  sparse.  These  studies  lead  to  the 
concept  of  an  optimal  or  minimum  expected  error  learning  machine  and  an  approximation  to 
such  machines  is  developed.  Comparisons  of  known  machines  to  optimal  or  near  optimal 
machines  are  made  to  obtain  a  measure  of  their  quality. 

In  order  to  keep  the  complexity  of  the  study  within  bounds,  the  assumed  model  is  a  binary 
function  learner  as  shown  in  Figure  1  with  p  binary  inputs.  The  various  learning  systems  to  be 
studied  are  all  adapted  to  acquire  functions  of  this  form  so  that  their  characteristics  become 
comparable.  This  paper  will  ignore  the  details  of  learning  algorithms  presented  in  the  literature 
and  concentrate  on  examining  the  learnable  classes  and  their  characte,  istics. 


Binary  Output 


Figure  1.  The  learning  system  mocp' 


The  next  section  will  present  three  example  learning  machines  and  formulate  more  pre¬ 
cisely  the  questions  to  be  examined  in  this  paper.  Then  a  section  will  examine  the  nature  of  the 
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tradeoff  between  the  expected  error  for  a  system  and  the  time  or  amount  of  information 
required  to  learn.  The  following  sections  will  overview  some  results  on  three  learning  models. 
Then  the  possibility  of  building  a  perfect  minimal  expected  error  learning  machine  will  be  exam¬ 
ined.  Such  an  ideal  machine  has  interest  in  its  own  right  and  serves  as  a  comparison  for  other 
more  easily  realized  machines.  Finally,  a  few-  summary  comments  conclude  the  paper. 

Three  Example  Learning  Machines  and  Some  Questions 

This  study  begins  with  a  cursory  examination  of  three  learning  machines,  a  signature  table 
system,  a  linear  system,  and  a  conjunctive  normal  form  machine.  All  will  follow  the  model  of 
Figure  1  with  p  =4. 

The  example  signature  table  system  appears  in  Figure  2.  The  constants  may  only  have 
values  0  or  1  and  the  output  is  computed  by  indexing  through  the  tables  starting  with  the 
inputs  at  the  bottom.  That  is,  the  values  of  the  Zj ’s  determine  the  output  values  of  the  lower 
two  tables,  and  these  values  become  the  inputs  to  the  top  table  which  yields  the  system  output. 
For  example,  an  input  of  (i  1,12,13,2-4)  =  (0.1, 1,0)  yields  outputs  of  Co  i=l  and  c  j  0=0  from 
the  lower  two  tables.  These  become  the  input  1,0  for  the  top  table  which  gives  an  output  of  1. 
Similarly,  all  of  the  sixteen  possible  values  for  the  vector  (z  1,1 2,2-3, i^)  can  be  input  to  this  sys¬ 
tem  resulting  in  the  total  system  behavior  shown  in  Figure  5.  Learning  is  done  by  varying  the 
values  of  the  Cu, ’s  and  Samuel  [14]  and  Biermann  et  al.  [l]  have  given  algorithms  for  doing  this 
type  of  learning. 


Figure  2.  A  signature  table  system. 


o  o  o 
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An  example  linear  system  appears  in  Figure  3.  Here  the  coclTicients  c,  and  ( 

* 

real  values  and  the  function  output  is  1  if  Yj  ®  otherwise.  Thus  for  t 

i=l 

values  given,  the  input  vector  (j  gi'  es  a  value  for  Yj 

1=1 


*  can  take  on 

he  coefficient 


of  -1  and  a 


function  output  of  0.  The  total  function  behavior  for  this  machine  is  shown  in  Figure  5.  Again, 
learning  is  achieved  by  varying  the  coefficients  c,-  and  6.  Some  algorithms  for  learning  are  given 
by  Minsky  and  Papert  [8],  Nilsson  [10],  and  Samuel  [13]. 


A  linear  system. 


Figure  3. 

A  third  learning  system  is  illustrated  Figure  4,  a  Boolean  conjunctive  normal  form 
acquisition  machine.  Here  the  input  variables  and/or  their  complements  are  combined  into 
every  possible  Boolean  sum  of  k  items  or  fewer  (where  k  =2  in  this  example).  Then  the  sums 
with  associated  coefficients  c,-  which  equal  1  are  combined  into  a  Boolean  product  to  produce 
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the  output.  If  we  assume  all  coefficients  are  zero  in  Figure  -4  except  for  C2,C3,  and  c^,  then  the 
example  system  computes  the  function  (j  1+13)3-2  (3/  +3^).  Thus  the  example  input 
(3  1,3 2,3- 3,1 4)=(0,1 ,1 ,0)  will  yield  (0+1)1  (0  -^0)=0.  All  other  inputs  can  be  similarly  tabulated 
as  shown  in  Figure  5.  Learning  is  done  by  adjusting  the  coefficients  c,  which  each  may  have 
the  value  of  0  or  1  and  a  learning  algorithm  is  given  by  Valiant  [16], 


Figure  4.  A  Boolean  conjunctive  norma]  form  machine. 
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Figure  5  thus  shows,  for  each  of  three  kinds  of  learning  machines,  a  learnable  behavior. 
The  16  bit  vectors  which  are  columns  in  this  table  are  called  learnable  vectors.  Each  type  of 
machine  has  a  set  of  possible  learnable  vectors  which  can  be  achieved  by  varying  its  constants. 
These  sets  of  learnable  vectors  are  called  learnable  classes  and  we  will  use  the  symbol  L  to 
designate  the  size  of  such  a  set  for  a  given  machine.  Usually  L  falls  far  short  of  the  total 
number  of  possible  such  vectors. 

When  the  learnable  vectors  do  not  cover  the  whole  space,  the  learning  machine  may  be 
asked  to  acquire  a  behavior  outside  its  class.  In  this  case,  it  can  select  a  vector  with  minimum 
Hamming  distance  to  that  target  vector.  It  may  have  difficulty  finding  a  learnable  vector  close 
to  the  target  if  its  learnable  vectors  are  clustered  as  shown  in  Figure  6(a).  On  the  other  hand, 
the  learnable  vectors  may  be  spread  across  the  space  uniformly  as  in  Figure  6(b)  so  that  every 
possible  target  vector  will  be  near  a  learnable  vector.  If  the  learnable  class  is  spread  so  as  to 
minimize  the  expected  Hamming  distance  |4)  from  a  randomly  selected  target  vector  to  the 
closest  learnable  vector,  then  the  learning  machine  will  be  called  optimum. 

The  following  questions  will  be  examined  here: 

1.  What  are  the  general  relationships  fo’’  all  learning  machines  between  the  number  of  learn¬ 
able  functions,  the  number  of  examples  required  to  learn,  and  the  rates  of  errors. 

2.  What  is  the  nature  of  the  learnable  class  for  each  type  of  machine? 

3.  If  a  target  behavior  is  selected  randomly  from  the  space  of  all  possible  behaviors,  what  is 
the  probability  that  the  learning  machine  will  be  able  to  acquire  that  behavior? 

4.  If  a  target  behavior  is  not  lea'-nable  but  the  machine  adapts  to  give  the  closest  possible 
behavior,  what  will  be  the  expected  number  of  errors? 

5.  What  is  the  nature  of  an  optimal  learning  machine  which  has  its  learnable  class  spread 
across  the  behavior  space  to  minimize  expected  error? 
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6.  How  do  the  known  learning  machines  compare  with  this  optimal  machine? 


a.  A  clustered  distribution  with  b.  A  uniformly  distributed  class 

high  expected  error.  with  minimum  expected  error. 

Figure  6.  Two  possible  distributions  for  L  =10  learnable 
vectors  (designated  by  A’  s )  in  a  behavior  space. 

Relating  the  Number  of  Learnable  Behaviors  to  Error  and  Time  to  Learn 

In  general,  as  the  number  L  of  learnable  vectors  increases,  the  expected  distance  from  a 
randomly  chosen  vector  to  its  nearest  learnable  vector  decreases  and  the  expected  time  to  learn 
increases.  These  relationships  are  examined  in  this  section.  Specifically,  a  bounding  formula  is 
found  showing  the  relationship  between  the  number  of  examples  required  to  learn  and  the  worst 
case  error  that  must  occur.  Second,  it  is  shown  that  the  expected  error  is  near  the  worst  case 
error  so  that  these  worst  case  results  cannot  be  taken  lightly.  Worst  case  behavior  is  not 
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obscure  behavior,  it  is  typical.  Finally,  an  e.vperimental  simulation  is  described  that  shows  how 
the  time  required  to  learn  is  related  to  L. 

Relating  L  u-ith  U'ors/  Case  Error.  Let  Sq  be  the  set  of  vectors  which  are  as  near  or  nearer 

to  a  given  learnable  vector  Vq  as  to  any  other  learnable  vector.  That  is,  Sq  is  the  set  of  possible 

target  vectors  that  would  result  in  the  selection  of  Vq  as  the  most  desireable  learned  behavior. 

We  will  say  that  Sq  is  the  set  of  vectors  covered  by  v o-  If  the  worst  case  error  in  Sq  is  D  (i.e., 

the  vector  in  Sq  with  greatest  Hamming  distance  from  Vq  differs  from  Tq  in  D  positions),  then 

D 

the  number  of  vectors  in  Sq  is  no  more  than  J])  (  •)  where  n  is  the  vector  length.  (There  are  (”) 

i=o  ’  ’ 

vectors  at  distance  i  from  Vq  for  each  i  =0,1,2,.. .,£)  .) 

Suppose  all  of  the  L  learnable  vectors  cover  sets  with  worst  case  error  D,  then  the  set  of 

D 

vectors  covered  by  them  all  is  no  greater  than  (  )•  total  number  of  such  vectors  is 

1=0  ’ 

2"  so 

LEO  >2’'  •  (1) 

I  =0 

This  inequality  relates  L  and  D  and  often  provides  useful  information.  For  example,  if  in  an 
application  someone  has  specified  a  maximum  allowed  worst  case  error  D,  then  the  parameters 
of  the  learning  machine  must  be  adjusted  to  make  L  large  enough.  If  the  learnable  vectors  of 
the  learning  machine  are  widely  dispersed  as  in  Figure  fib,  then  (1)  will  give  a  good  estimate  of 
the  required  value  of  L.  If  the  learnable  vectors  are  clustered  (Figure  6(a)),  then  (1)  will  give 
only  a  distant  lower  bound  for  the  ref^uired  value  of  L.  Also  it  is  clear  that  at  least  log^Z, 
observations  must  be  made  in  order  to  differentiate  the  final  learned  b.uavior  from  the  L  alter¬ 
natives.  So  (from  (])  above)  the  number  of  observed  input-output  observations  in  order  to 

/>  n 

guarantee  a  learned  behavior  with  no  more  than  D  errors  is  no  lower  than  n  -logo  ( J. 
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Relationship  (1)  can  be  put  into  more  understandable  form  using  an  approximation  given 
by  Hamming  [4],  page  165  .  If  we  define  X  to  be  X  =  Z)  /«  ,  then  under  the  condition 


n  >  - - - 

~  2rX  (1-2X)2 

the  following  relationship  holds: 

I  0 


(2) 


where  H  {\)  =  -log  (  X^(l-X)^‘^).  Substituting  (2)  into  (1)  and  rearranging  yields  a  form  whose 
trends  are  clear  for  large  n. 

log^L 

-^>1-//(X)  (3) 

If  the  allowed  percentage  of  errors  X  is  to  be  small,  then  H  {\)  will  be  small  also.  This  means 
that  the  number  logoL  of  observations  needed  to  learn  needs  to  be  nearly  as  large  as  n,  the 
total  number  of  values  of  the  function. 

As  an  example,  suppose  a  designer  is  building  a  game  playing  program  with  30  3-bit 
features  and  requires  learning  to  achieve  95  percent  accuracy.  Then  the  learning  machine  will 
require  at  least 

2^°  (1  -  0.2S)  =  0.72  X 

examples  to  guarantee  that  the  required  accuracy  will  be  met.  In  other  words,  the  learning 
machine  will  have  to  observe  72  per  cent  of  all  function  values  to  achieve  the  specified  accuracy. 

Suppose  we  have  a  class  of  learning  machines,  such  as  the  signature  tables  or  perceptrons, 
with  the  properly  that  there  is  member  of  the  class  for  each  integer  p.  Then  one  can  define 
Lp  to  be  the  number  of  learnable  functions  for  the  member  of  the  class  with  p  inputs.  Then  (3) 
can  be  rewritten  as 
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>  1  -  W(X)  (4) 

We  will  say  a  class  of  learning  machines  is  realization  sparse  if  the  left  side  of  (4)  approaches 
zero  as  p  becomes  large. 

The  minimum  number  logo(Lj,  )  of  observations  required  to  learn,  as  specified  by  (4),  varies 
as  shown  in  Figure  7.  This  curve  shows  the  tradeoff  between  low  error  with  many  learning 
examples  and  high  error  with  few.  This  relationship  is  a  fundamental  law  for  all  machines  that 
match  the  model.  Realization  sparse  classes  have  the  property  that  )  is  very  small  com¬ 

pared  to  2^  if  p  is  large  and  this  leads  to  worst  case  errors  near  50  percent.  It  is  a  basic  result 
of  this  paper,  that  many  if  not  most  of  the  commonly  studied  learning  machines  are  realization 
sparse. 


The  number  of  observations 
required  to  learn 


Fraction 
of  allowed 
errors 


Figure  7.  Minimum  bound  on  the  number  of  observations  required  to  learn 
versus  the  fraction  X  of  allowed  errors. 


Relating  L  with  Eipected  Error.  The  above  observations  relate  to  the  worst  case  error  that 
could  occur  if  the  target  function  is  chosen  randomly  from  the  space  of  all  possible  functions. 
One  can  hope  that  the  expected  error  will  be  considerably  less.  In  fact,  the  expected  error  will 
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be  greater  than  tvo  thirds  the  worst  case  error  D  as  can  be  seen  from  the  following  argument. 

In  order  to  get  a  lower  bound  on  the  expected  error,  assume  the  n-space  is  divided  into  L 
spheres  that  surround  learnable  vectors  where  each  sphere  has  diameter  D.  A  less  uniform  dis¬ 
tribution  of  learnable  vectors  would  yield  a  higher  expected  error  so  this  assumption  is  accept¬ 


able  when  computing  a  lower  bound.  There  will  be  (”)  vectors  at  radius  i  from  each  Vq  for  each 

i.  This  number  can  be  represented  for  large  n  by  the  normal  distribution  as  shown  in  Figure  8. 
It  is  clear  that  the  expected  value  for  a  normal  distribution  is  greater  than  that  for,  say,  the 


linear  approximation  shown  which  has  an  expected  value  of  —  D  .  {We  assume  that  D  is  low 

3 


enough  to  be  below  the  inflexion  point  on  the  normal  curve; 


This  is  the  desired  result;  the  expected  value  for  error  is  2/3  as  large  as  the  worst  case  examined 
above,  so  the  graph  of  Figure  7  gives  a  good  approximation  to  typical  results  as  well  as  worst 


case  results. 
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linear  approxirriation 
to  norma)  curve 


Figure  8.  Tlie  expected  error  b  will  be  greater  than  a=2/3D  . 

A  more  accurate  lower  bound  to  expected  error  can  be  computed  by  accounting  precisely 
for  all  of  the  2"  vectors.  If  D  is  the  worst  case  error  for  vq,  the  sphere  of  vectors  at  distance 
D-1  and  less  from  Vq  t'ill  be  completely  full,  and  as  few  as  possible  will  be  at  distance  D  from 
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I'o-  Remember  that  (his  means  that  for  a  given  L,  D  is  the  lowest  integer  such  that  (1)  holds. 
With  this  assumption  (which  is  not  realizable  for  every  L),  one  can  compute  the  expected  error. 
It  provides  a  lower  bound  for  any  realizable  expected  error. 


expected  error  lou'er  bound  = 


I  X  (number  of  vectors  at  distance  i) 
i  =0  _ _ 

■-)" 


D-l 

j  X  (number  of  vectors  at  distance  i)—D  X  (nuv'.ber  of  vectors  at  distance  D) 

i  =0 _ 

o" 


1  X  X  (2"  -(number  of  vectors  below  distance  D  )) 

I  =0 _ ^  ^ _ 

2" 

2" 


(5) 


This  lower  bound  is  used  as  a  comparison  figure  for  various  machines  later  in  the  paper. 

The  relationship  (1)  is  often  mentioned  in  studies  of  coding  theory  where  the  goal  is  to 
select  codes  (vectors)  which  are  maximally  distant  from  each  other  in  order  to  maximize  correc- 
tability.  (See  [4,11])  Incoming  vectors  which  are  near  a  legal  code  (using  Hamming  distance) 
are  assumed  to  be  erroneous  versions  of  that  code  and  are  corrected  to  it.  An  interesting  study 
in  coding  theory  relates  to  “perfect  codes”  in  which  relationship  (l)  holds  as  an  equality. 

L  !:(”)  =  2" 

1=0 

For  example,  a  solution  for  this  equation  is  provided  by  the  Golay  code  ([llj,  page  70)  where 
n  =23,  D  =3,  and  L  =2’^  and  the  associated  code  is  well  known.  In  the  current  domain,  this 
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corresponds  to  having  4096  Icarnable  vectors  of  length  23  wiih  every  one  of  the  2^=S,3SS,60S 
possible  behaviors  being  within  3  or  less  of  some  learnable  vector.  Thus  learning  12  bits  of 
information  (out  of  23)  is  enough  to  reduce  the  probability  of  error  to  no  more  than  3/23 
assuming  all  inputs  are  equally  likely.  (The  reader  might  wish  to  ponder  the  seeming  contradic¬ 
tion  that  learning  12  bits  apparently  results  in  “knowing”  20  bits.) 

Relating  L  to  Ti?ne  to  Learn.  It  is  difficult  to  compute  the  ti.me  required  to  learn  in  com¬ 
parison  with  L  because  it  depends  on  the  learning  algorithm  used,  the  particular  behavior  that 
is  to  be  acquired,  and  the  information  available  about  the  target  behavior.  Suppose  the  learn¬ 
able  vectors  are  v  i, v  O)- -  that  the  target  vector  is  rj  .  Suppose  further  that  learning  is 

to  be  done  by  observing  randomly  selected  input-output  pairs.  That  is,  random  positions  in  vj 
are  observed  during  learning  and  the  best  learnable  vector  at  any  given  time  is  selected  knowing 
only  these  few  entries  in  .  Let  s  denote  the  set  of  inputs  for  which  the  output  has  been 
observed  and  let  |  v  |  ,  denote  vector  v  with  all  entries  corresponding  to  inputs  not  in  s  set  to 
zero.  For  example,  if  p  —2,  iy  is  as  shown. 


input  Vf 


0  0 

1 

0  1 

0 

1  0 

1 

1  1 

1 

and  it  has  been  observed  during  learning  that  0  0  yields  1  and  1  0  yields  1,  then  s  ={00,10}  and 
I  rj-  I  ,  =(1,0, 1,0).  One  could  propose  that  the  learning  algorithm  should  find  the  first  in  the 
learnable  class  such  that  the  Hamming  distance  between  |  v;-  |  ,  and  |  u,-  |  ,  is  minimized.  If 
we  agree  that  Vj  is  learned  when  this  learning  algorithn.  lects  a  learnable  vector  and  never 
again  changes  its  guess,  then  Figure  9  shows  for  each  time  t  the  probability  that  vj-  is  learned 
as  a  function  of  L  for  n  =4.  Figure  9  assumes  that  an  optimum  set  of  L  vectors  is  to  be  used 


2 


(where  “optimum”  is  defined  in  the  previous  section).  It  indicates  the  way  the  rate  of  learning 
varies  with  L  for  such  an  optimum  machine  and  other  algorithms  will  be  no  faster.  The 
expected  decrease  in  learning  speed  is  observed  as  L  increases. 


Figure  9.  Probabili 
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Signature  Tables 

One  of  the  goals  of  the  paper  is  to  examine  properties  common  to  all  learning  systems  as 
was  done  in  the  previous  section.  Another  goal  is  to  look  at  properties  of  specific  learning  sys¬ 
tems  such  as  signature  tables,  linear  systems,  and  conjunctive  normal  form  systems.  The  sec¬ 
tion  begins  the  latter  task  by  examining  the  characteristics  of  signature  tables.  Specifically,  we 
would  like  to  know 

(a)  what  is  the  nature  of  the  learnable  vectors  for  signature  table  systems,  and 

(b)  how  many  learnable  vectors  are  there  for  given  table  configurations. 

The  answer  to  (b)  will  make  it  possible  to  find  the  probability  that  an  unknown  target  behavior 
will  be  learnable  by  such  systems.  It  will  also  give  a  way  of  estimating  worst  case  error  using 
(1)  from  the  previous  section. 

The  characterization  of  the  learnable  functions.  The  nature  of  the  learnable  vector  will  be 
studied  for  the  type  of  signature  table  system  T  as  shown  in  Figure  10.  This  system  is  best 
understood  by  constructing  the  matrix  M(f,S)  where  f  is  the  function  that  is  computed  by  T  and 
S  is  the  subset  of  the  inputs  to  T  that  feed  into  a  single  table  in  T  (Biermann  et  al.  [1]  ). 
M(f,S)  is  built  by  constructing  one  row  for  every  possible  value  of  the  inputs  S  and  one  column 
for  every  possible  value  of  the  inputs  not  in  S.  The  value  of  an  entry  in  row  i  and  column  j  of 
M(f,S)  is  the  value  of  f  when  f  receives  input  values  in  S  associated  with  row  i  and  the  input 
values  not  in  S  associated  with  column  j.  Thus  in  the  case  of  Figure  10,  the  value  of  f 
corresponding  to  input  (A''i,A%,-V3,A'"4,A^s,A''6)  =  (0,0,1,0,1,0)  is  1.  Let  5  =  {A^ j,A'o,A'3}  and 
cons'r'ict  M(f,S).  Then  the  entry  in  the  row  labelled  (A'2,.X'2,.V3)  =  (0,0,1)  and  the  column 
labelled  (A^i,A'3,A'6)  =  (0,1,0)  should  be  I.  Similarly,  all  other  entries  in  M(f,S)  can  be  con¬ 
structed  as  shown  in  Figure  10. 


Figure  10.  An  example  table  system  T  with  its  matri.v  M(T). 
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The  theorem  is  that  f  can  be  realized  by  T  if  and  only  if  for  every  S  that  comprises  the 
inputs  to  a  table  in  T,  the  size  of  the  output  alphabet  in  that  table  is  at  least  as  large  as  the 
number  of  distinct  rows  in  M(f,S).  In  Figure  10,  this  means  that  since  /  j,  can  take  on  only 
three  values  (0,1,2),  M {J  3})  must  have  three  or  fewer  distinct  rows.  Furthermore 

M  {J  ,{t4,Ts,76})  csn  have  only  three  or  fewer  rows  because  J  2  can  have  only  three  values. 
These  limitations  force  a  kind  of  repetitiveness  on  any  realizable  function  f,  and  they  precisely 
specify  the  set  of  functions  that  such  a  system  can  realize. 

If  the  limitations  on  output  alphabet  size  are  relaxed,  larger  classes  of  functions  can  be 
realized  but  the  tables  quickly  become  very  large.  In  the  extreme  case  where  there  are  no  limi¬ 
tations  on  alphabet  size,  the  table  becomes  sim’-’j  ^  .  enumeration  of  every  possible  input  and 
its  associated  output  and  it  can  re;,  zt  any  function. 

The  characterization  hoids  for  ‘.a’  .e  systf'ms  of  any  depth.  For  example,  consider  Figure 
11  and  let  v,  represent  the  alphabet  size  for  Then  the  row  multiplicity  of  .\/(/  is 

i;3,  of  .\/(/  ,{2- 3, 24})  is  V4,  of  M  {f  i-3'2.z’3:T  4})  is  Vj,  and  so  forth.  A  function  can  be 
represented  by  a  signature  table  system  if  and  only  if  the  kind  of  repetitiveness  observed  h...'? 
occurs  in  f. 
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Other 
results  are 
Curtis  |3]. 


Figure  11.  A  three  level  signature  table  system. 

properties  of  signature  tables  have  been  given  by  Biermann,  c‘  al.  [l].  These 
related  to  studies  on  switching  circuit  decomposition  as  presented,  for  example,  by 
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Counting  the  number  of  Uarnalle  functions.  A  methodology  for  computing  the  number  of 
functions  realizable  by  a  given  signature  fable  system  is  given  by  Biermann  el  at.  [l]  and  an 
improved  version  appears  as  Appendix  A  in  this  paper.  As  an  example,  the  number  of  functions 
realizable  by  table  systems  of  the  form  in  Figure  10  if  /  j  and  /  o  are  binary  can  be  ccmiputcd 
for  p  inputs  assuming  P\  =  p  /  2  of  the  inputs  are  to  the  lower  left  table  and  P2~  P  /  2 
inputs  are  to  the  lower  right  table  where  p  b  even.  If  p  is  odd,  then  the  two  lower  tables 

receive,  respectively,  Pi  =  ^ ^  p  0=  ^  inputs.  In  fact,  the  number  of  functions  is 

given  by 


L,  =52 


-2-  +--2-  *-=+8  . 


Returning  to  the  definitions  given  above,  it  is  easy  to  check  that  this  class  of  machines  is 
realization  sparse.  Therefore,  as  p  increases,  we  can  expect  Lp  to  become  small  with  respect  to 
the  number  of  functions  2*^  and  the  worst  case  error  to  approach  50  percent.  The  following 
table  gives  an  evaluation  for  these  quantities  and  shows  how  quickly  the  trend  becomes 
apparent. 


Number 

Number  of 

Total  number 

Lower  bound 

Number  of 

Percent 

of  inputs 

realizable 

of  functions 

on  worst  case 

function 

error 

P 

functions  L 

error  D 

values 

1 

4 

4 

0 

0 

0 

2 

16 

16 

0 

4 

0 

3 

88 

256 

1 

8 

12 

4 

520 

65536 

2 

16 

12 

5 

9160 

4294967296 

6 

32 

19 

6 

161800 

1.844X10'® 

15 

64 

23 

7 

41679880 

3.403X10“ 

34 

128 

27 

S 

10736893960 

1.158X10'^ 

78 

256 
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If  deeper  trees  are  examined  as  shown  in  Figure  11,  for  example,  the  trend  is  also  worse. 
Consider  the  set  of  binary  frees  of  depth  d  where  d  is  the  number  of  levels  of  tables.  (Thus 
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d=3  in  Figure  11.)  Assume  that  all  tables  are  restricted  to  vocabulary  sizes  of  two  or  less. 
Then  it  is  shown  in  Appendix  A  that  the  ratio  of  the  number  of  realizable  functions  to  possible 

functions  is  less  than  l/(3(2"'  ))  if  d  is  greater  than  2. 

In  conclusion,  signature  table  systems  can  realize  functions  that  have  the  kind  of  repeti¬ 
tiveness  described  here:  The  M{f  ,S)  matrices  can  have  only  as  many  distinct  rows  as  there 
are  alphabet  symbols  in  the  associated  table.  The  counting  results  indicate  that  such  table  sys¬ 
tems  can  realize  relatively  few  functions.  Most  classes  of  signature  table  systems  are  realization 
sparse. 

Linear  Models 

Linear  models  have  been  discussed  extensively  in  the  literature  beginning  with  the  studies 
of  threshold  functions  (Muroga  [9]  and  Nibson  jlO]  )  and  perceptrons,  Minsky  and  Papert  'S^), 
continuing  with  linear  evaluation  methodologies  in  applications  such  as  game  playing  (Samuel 
[13]),  and  including  more  recently  studies  in  connectionLst  learning  (Rumelhart  et  al.  [12], 
McClelland  et  al.  [5]  Sejnowski  [15]).  Characterizations  of  the  learnable  classes  of  functions 
have  been  given  in  terms  of  linear  separability  of  points  in  a  p-dimensional  space  (Muroga  [9], 
Nilsson  (lOj)  and  in  terms  of  computability  of  geometric  properties  on  a  two  dimensional  grid 
(Minsky  and  Papert  [8]).  A  characterization  has  also  been  given  in  terms  of  function  monotoni¬ 
city  as  will  be  described  here  (from  Muroga  [9]). 

Consider  two  inputs  to  a  linear  function  f  as  described  in  the  second  section  while  all  other 
inputs  remain  constant.  Suppose  these  inputs  lake  the  values  00  and  then  01  and  that  the  out¬ 
put  from  f  simultaneously  goes  from  0  to  1.  Theu  one  can  conclude  that  the  weight  on  the  bit 
that  changed  is  positive.  Given  that  this  weight  is  positive,  one  can  be  sure  that  on  inputs  10 
and  11,  there  cannot  be  a  transition  in  the  output  from  a  1  to  a  0  since  this  would  require  a 
negative  weight.  A  generalization  of  this  idea  results  in  the  concept  of  a  completely  monotone 
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function;  Consider  for  any  function  /  the  outputs  /  (A''^  )  and  /  (.V^ )  where  the  subscript  A 
represents  a  particular  setting  of  certain  input  bits  in  vector  .V  and  subscript  A  represents  the 
opposite  setting,  with  all  other  bits  in  vector  A""  remaining  the  same.  If  for  all  input  vectors  A' 
the  transition  of  outputs  for  the  input  transition  A^  to  Xj  is  in  the  same  direction  or  makes 
no  change,  then  the  function  is  considered  monotone  with  respect  to  A  .  If  the  function  is 
monotone  with  respect  to  all  A  ,  then  it  is  completely  monotone. 

The  characterization  as  explained  in  Muroga  [9]  is  that  every  linear  function  b  completely 
monotone.  Furthermore,  every  completely  monotone  function  with  p  <  8  is  realizable  as  a 
linear  function.  However,  there  exist  completely  monotone  functions  with  p=9  which  are  not 
realizable. 

Concerning  the  number  L  of  realizable  functions  using  the  linear  model,  no  formula  has 
been  discovered  for  doing  this  computation.  However,  it  is  known  (Muroga  [9])  that  for  linear 
functions 

Ip  <  2'’* 

which  leads  to  the  conclusion  that  this  class  is  also  realization  sparse. 

Thus  this  class  can  also  be  expected  to  exhibit  error  rates  approaching  50  percent  as  p 
becomes  large.  The  effect  can  be  observed  in  the  following  table  which  was  compiled  from 
figures  given  for  Lp  by  Muroga  [9],  page  821. 
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Number  of 

Number  of 

Total  number 

Lower  bound 

Number  of 

Percent 

inputs  p 

realizable 
functions  L 

of  functions 

on  worst  case 

error  D 

function 

values 

error 

1 

4 

4 

0 

2 

0 

2 

14 

16 

1 

4 

25 

3 

104 

256 

1 

8 

12 

4 

1882 

65536 

2 

16 

12 

5 

94572 

4,294,967,296 

5 

32 

16 

6 

15,028,134 

1.844  X10‘® 

12 

64 

19 

7 

8,3  /  8,0 <  0,864 

3.403X10'^® 

29 

128 

23 

8 

17,561,539,552,946 

1  158X10" 

70 

256 
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Conjunctive  Normal  Form  Expressions 


V'aliant  [16]  has  studied  the  problem  of  learning  k-conjunct ive  normal  form  Boolean  expres¬ 
sions  uhich  have  the  form  of  a  product  of  sums  of  k  (or  fewer)  input  variables.  The  model 
assumes  examples  of  the  behavior  are  presented  according  to  a  probability  distribution  and  the 
major  result  is  that  if  the  target  function  is  learnable,  it  Mill  be  identified  with  high  probability 
using  only  a  polynomial  number  on  p  of  examples. 

The  class  of  functions  realized  by  such  expressions  may  be  visualized  as  those  functions 
which  yield  1  on  intersection  of  a  number  of  regions  on  a  Venn  diagram  of  p  dimensions  where 
each  region  is  the  union  of  k  (or  fewer)  input  variables.  If  k=p  then  every  possible  function  on 
p  variables  can  be  realized.  Typically  k  will  be  less  than  p  to  reduce  memory  requirements  and 
time  needed  to  learn. 

As  with  signature  tables  and  threshold  functions,  the  conjunctive  normal  form  expressions 
form  a  realization  sparse  class.  This  can  be  seen  by  the  following  argument: 

The  number  of  conjuncts  of  length  k  on  p  input  variables  and  their  negations  is  (2p  )*  . 
(Conjuncts  of  lengt.  .ess  than  k  need  not  be  considered  because  each  one  can  be  represented  by 
a  product  of  k  length  conjuncts.  Thus,  if  k=2  one  can  represent  z  j  as  (i  1+7 2)(j  i  +  i 2  ).)  The 
number  of  different  conjunctions  is  the  number  of  subsets  of  (2p  )*  objects;  .  Not  all  of 
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these  conjunctions  yield  unique  functions  so  the  number  of  functions  is  less  than 

Z-p  <  .  Applying  the  definition  given  above,  we  see  that  this  class  is  also  realization 

sparse. 

In  Search  of  the  Optimal  Learning  Machine:  Truncation  Machines 

The  above  three  sections  describe  the  nature  and  cardinality  of  learnable  classes  for  three 
kinds  of  learning  machines.  The  question  arises  as  to  what  learnable  set  of  behaviors  would  be 
most  desirable  if  one  could  specify  them  precisely.  The  measure  to  be  considered  here  is  the 
expected  error  rate  assuming  that  the  target  behavior  is  selected  from  the  set  of  all  possible 
behaviors  with  each  behavior  being  equiprobable  and  with  all  inputs  to  the  target  behavior 
being  equiprobable.  In  other  words,  the  following  question  is  being  asked:  Find  a  set  of  L 
behaviors  which  will  be  the  learnable  set  for  the  proposed  machine  and  which  will  have  the  pro¬ 
perty  that  no  other  set  of  L  vectors  will  have  lower  expected  error  (assuming  uniform  distribu¬ 
tion  on  the  target  behaviors  as  mentioned  above).  Two  solutions  will  be  proposed  to  this  prob¬ 
lem,  the  truncation  machine  described  here  and  the  Gl  machine  described  in  the  next  section. 

One  obvious  way  to  design  the  class  of  learnable  vectors  is  to  segment  them  into  subvec¬ 
tors  of  some  6xed  length,  say  q,  and  require  that  all  outputs  within  a  segment  be  the  same: 

I  XITTXX  I  XXXXIX  I  .  I  XXXIXX  | 

I  I  I  .  I  — ?  —  I 

As  an  example,  suppose  p— 4  so  that  output  vector  length  is  16.  If  q=4,  the  following  sixteen 
vectors  would  compose  the  learnable  set; 

1.  00  00000000000000 

2.  0000000000001111 

3.  0000000011110000 

4.  0000000011111111 
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5.  0000111100000000 

6.  0000111100001111 

7.  0000111111110000 

8.  0  0  0  0  1  1  1  1  1  1  1  1  1  1  1  1 

9.  1111000000000000 

10.  1111000000001111 

11.  1111000011110000 

12.  1  1  1  1  0  0  0  0  1  1  1  1  1  1  1  1 

13.  1111111100000000 

14.  1  1  1  1  1  1  1  1  0  0  0  0  1  1  1  1 

15.  1  1  1  1  1  1  1  1  1  1  1  1  0  0  0  0 

16.  1111111111111111 


It  would  appear  that  this  set  of  L  =  16  vectors  uniformly  span  the  space  of  all  possible  vectors 
of  length  16  and  that  matching  a  target  vector  could  be  done  optimally  and  in  a  straightforward 


manner.  Suppose,  as  an  example,  that  the  target  behavior  is 


001000001  101001  1. 


Then  the  learning  machine  could  approximate  this  behavior  with  this  vector: 


0000000011110000 


.\nd  the  rate  of  error  would  be  4  out  of  16  or  25  percent.  That  is,  each  segment  of  q  values  is 
set  to  either  0  or  1  to  match  the  majority  of  values  in  the  target  vector  in  that  segment.  If 
there  are  an  equal  number  of  O’s  and  I’s  in  the  target  vector  for  a  given  segment  as  occurs  in 
the  rightmost  segment  in  the  above  example  target,  the  learning  machine  can  place  either  O’s  or 
I’s  in  that  segment  with  the  same  expected  error  rate. 

The  next  question  is  whether  one  can  build  a  machine  with  this  class  of  learnable  vectors. 
If  q  =2'  for  some  integer  r,  then  such  a  machine  is  straightforward  to  build.  Examining  these 
learnable  vectors,  one  can  see  that  the  output  is  independent  of  the  rightmost  r  bits  of  the  input 
and  completely  dependent  on  the  remaining  p-r  inputs.  So  the  desired  learning  machine  is  as 
shown  in  Figure  12,  a  system  which  ignores  (or  truncates)  r  inputs  and  keeps  a  table  to  specify 


tlie  output  on  tlie  basis  of  the  remaining  inputs.  Thus  tliis  is  tailed  the  truncation  machine. 


Figure  12.  The  truncation  macliine. 

Next  tlie  characteristics  of  the  truncation  machine  sliould  be  examined.  The  first  observa¬ 
tion  is  that  the  wors  » ase  error  is  n/2.  By  filling  each  segment  witli  q/2  O’s  and  q/2  I’s,  one 
can  build  the  worst  possible  target  vector  for  this  machine.  For  the  ^ibove  example  it  would  be: 


0011001100110011 
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A  worst  case  error  of  n/2  is  a  bad  omen  because  most  of  the  machines  examined  earlier  had 
belter  worst  case  error  characteristics. 


What  about  the  expected  error  for  randomly  selected  target  vectors?  This  can  be  com¬ 
puted  by  the  formula 


expected  error  =  — 


* 

1 

• 

_  n 

1  - 

9 

2 

off 

2 

which  is  derived  in  Appendix  B.  It  turns  out  that  this  is  not  nearly  optimal.  In  fact,  in  the 
case  n  =  16,  q=4,  the  average  error  is  5.0  whereas  a  set  of  L=16  vectors  has  been  found  with  an 
average  error  of  4.27. 

.\s  will  be  seen  below,  the  truncation  machine  is  not  even  as  good  as  the  previously  dis¬ 
cussed  machines  from  the  point  of  view  of  minimum  expected  error.  So  an  initial  attempt  to 
build  a  machine  with  low  expected  error  has  failed.  The  next  section  shows  a  more  sophisti¬ 
cated  approach. 


In  Search  of  the  Optimal  Learning  Machine:  The  Gl*Machine 

If  the  uniform  set  of  learnable  vectors  given  above  for  the  truncation  machine  b  not  dbtri- 
buted  to  achieve  minimum  expected  error,  what  are  the  characteristics  of  such  an  optimum  set? 
An  example  of  a  near  optimal  set  was  found  in  our  studies  and  is  given  here,  a  set  of  L=16  vec¬ 
tors  of  length  n  =  l6  with  average  expected  error  of  4.27.  That  is,  a  randomly  selected  vector 
from  n=16  space  will  be,  on  the  average,  a  Hamming  distance  of  4  27  bits  away  from  its  nearest 
neighbor  in  the  learnable  set. 
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1.  0000  0000  0000  0000 

2.  0001  0000  1111  nil 

3.  0010  0111  0000  1111 

4.  0011  0111  1111  0000 

5.  0100  1011  0011  0011 

6.  010110111100  1100 

7.  0110  1100  0011  1100 

8.  0111  1100  1100  0011 

9.  1000  1101  0101  0101 

10.  100111011010  1010 

11.  1010  1010  01011010 

12.  10111010  1010  0101 

13.  1100  0110  0110  0110 

14.  1101  0110  1001  1001 

15.  1110  0001  0110  1001 

16.  nil  0001  1001  0110 


It  is  possible  that  t^‘.  .t  is  optimal  and  that  no  other  set  of  16  vectors  exist  with  a  lower 
expected  error.  have  not  been  able  to  prove  this  but  we  note  that  the  achieved  expected 
error  is  netii  the  lower  bound,  4.18,  that  can  be  computed  from  (5)  above. 

it  would  be  desirable  to  be  able  to  construct  optimum  sets  for  any  n  and  L.  Then  the 
learning  machine  design  procedure  would  follow  these  steps: 

(1)  Specify  n  and  the  desired  maximum  allowable  expected  error. 

(2)  Find  the  L  required  to  reduce  the  expected  error  to  required  level. 

(3)  Synthesize  L  vectors  which  have  minimum  expected  error  and  the  machine  to  meet  the 
specifications. 

Unfortunately,  there  is  no  known  method  for  finding  such  optimum  sets  except  through  com¬ 
pletely  enumerative  methods.  This  section  gives  a  method  for  constructing  near  optimal  sets. 
The  resulting  machine  will  be  called  the  Gl  machine  (named  for  the  second  author  of  this 
paper  ). 

The  construction  method  of  the  Gl  machine  will  be  first  explained  graphically  and  then  in 
terms  of  binary  vectors.  The  method  assumes  that  L  —c  2*  learnable  vectors  are  to  be  spread 
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across  the  space  of  all  behaviors  %vhere  c  and  h  are  positive  integers.  The  total  space  is  broken 
into  2*  subspaces  each  with  c  learnable  vectors.  The  construction  method  begins  by  placing  c 
vectors  optimally  in  one  subspace  as  shown  in  Figure  13(a)  (where  c  =2).  These  optimal  posi¬ 
tions  are  found  by  enumerative  or  other  means.  Then  the  configuration  is  doubled  as  shown  in 
(b).  However,  (b)  can  be  improved  by  trying  all  possible  rotations  of  the  new  space  with  respect 
to  the  old  space  to  find  the  rotation  which  yields  minimum  expected  error.  Figure  13fc)  shows 
such  a  rotation.  The  improvement  comes  from  the  fact  that  some  vectors  (those  near  B)  are 
nearer  a  learnable  vector  after  the  rotation  than  before.  The  rightmost  learnable  vector  in  the 
original  subspace  is  now  nearer  those  points  in  B  than  any  learnable  vector  in  B’s  own  subspace. 

The  construction  continues  as  shown  in  Figure  13(d)  and  (e).  The  space  is  doubled  again 
and  rotated  again  for  further  reduction  in  expected  error.  This  process  continues  until  the  com¬ 
plete  behavior  space  has  been  accounted  for.  The  complete  synthesis  procedure  is  near  optimal 
because  all  the  2*  subspaces  are  locally  optimal  and  they  are  rotated  optimally  with  respect  to 
each  other.  However,  they  are  not  globally  optimal. 
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(a)  A  'basic  space*  with  two 
optimally  placed  tearnable 
vectors. 


(b)  Doubling  the  basic  space. 


(c)  Rotating  the  new  space  to 
reduce  expected  error. 


(d)  Doubling  again. 


(e)  Rotating  again  to  further 
reduce  expected  error. 


Figure  13.  Constructing  the  Gl  machine. 

Gl  algorithm  operates  in  the  fash  <  illustrated  in  Figure  13  except  that  it  creates  a 
set  of  binary  vectors  which  are  designed  to  cover  the  behavior  space  nearly  optimally.  (The 
coding  theorists  [ll]  use  a  similar  approach  in  their  search  for  good  codes.)  Suppose,  for  exam¬ 
ple,  that  the  initial  space  is  the  set  of  binary  vectors  of  length  three  :000,001,...,1 11.  Let  us 
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choose  a  single  vector  that  will  cover  this  space  optimally.  Any  vector  will  do,  so  let  it  be  000. 

000 

Then  this  set  of  learnable  vectors  is  doubled  by  writing  down  two  copies,  one  copy  with  a  new 

zero  added  to  the  left  end  and  one  copy  with  a  one  added  to  the  left  end. 

0000 

1000 

Next  the  new  copy  is  rotated  with  respect  to  the  old  one  as  was  done  in  Figure  13(c).  That  is, 
some  of  the  columns  in  the  new  copy  are  inverted  to  reduce  the  average  error  of  vectors  of 
length  four  to  their  closest  learnable  vector  in  the  set.  There  are  eight  possible  inversions  yield¬ 
ing  the  following  possible  learnable  sets; 

0000  (no  inversion) 

1000 

0000  (invert  rightmost  column 

1001  in  new  space) 

0000  (invert  second  from  rightmost 

1010  column  in  new  space) 

0000  (invert  all  columns  in  new  space) 

nil 

Of  the  eight  choices,  an  inversion  b  selected  which  achieves  minimum  expected  error  for  the 
space  of  vectors  of  length  4.  In  this  case,  the  vectors 

0000 

1011 

were  selected.  This  b  a  Gl  machine  for  the  space  n=4. 

For  the  space  n=5,  the  next  Gl  machine  is  constructed  by  copying  the  above  learnable  set 
and  again  inverting  columns  in  the  second  copy.  The  copy  operation  yields 
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00000 

01011 

10000 

non 


and  any  of  the  three  rightmost  columns  of  the  lower  two  vectors  may  be  inverted.  All  such 

inversions  are  attempted  and  in  each  case,  the  average  distance  for  a  randomly  selected  vector 

of  length  five  from  one  of  these  learnable  vectors  is  computed.  The  rotation  (or  inversion)  that 

yields  the  least  average  distance  (or  error)  is  selected  In  this  case,  the  rightmost  and  third  from 

rightmost  columns  were  inverted  in  the  last  two  rows. 

00000 

01011 

10101 

lino 


These  four  vectors  are  the  learnable  set  for  a  Gl  machine  on  the  space  n=5. 


Repeating  the  construction  for  a  third  time  yields  a  Gl  machine  for  the  case  n=6.  Copy¬ 
ing  yields  eight  vectors. 

000000 

001011 

010101 

011110 

100000 

101011 

110101 

111110 


The  best  rotation  of  the  rightmost  three  columns  on  the  lower  half  of  the  vectors  involves 
inverting  the  second  and  third  bits  from  the  right  side. 
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000000 

001011 

010101 

011110 

100110 

101101 

110011 

111000 


Clearly  the  construction  can  be  repeated  to  obtain  Gl  machines  for  any  n-space.  In  this 
example,  L  is  one  eighth  of  the  set  of  all  vectors  (L  =2" /S)  because  the  initial  space  had  one 
learnable  vector  covering  eight.  In  general,  the  ratio  of  L  to  2"  can  be  any  power  of  two 
depending  on  the  selection  of  the  initial  space. 

Figure  1-1  gives  a  plot  of  the  expected  error  for  many  G  1  machines  in  comparison  with  the 
known  lower  bound  (5).  The  figure  shows  that  these  machines  are  either  optimum  or  near 
optimum  in  all  of  the  cases  examined. 


to  the  lower  bound 
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The  Gl  machine  docs  not  have  practical  significance  since  its  only  definition  is  that  it  is  a 
set  of  learnable  vectors.  There  is  no  known  realization  for  such  machines.  However,  theoreti¬ 
cians  interested  in  minimizing  expected  error  are  bound  to  wonder  what  a  set  of  optimally  dis¬ 
tributed  vectors  look  like,  and  the  Gl  construction  provides  a  way  to  generate  near  optimal  sets 
if  p  is  not  too  large. 

The  Random  Vectors  Machine 

Another  interesting  way  to  construct  a  learning  machine  is  to  choose  its  learnable 
behaviors  randomly.  While  this  may  not  lead  to  practical  devices,  it  provides  a  useful  com¬ 
parison  for  other  machines.  Specifically,  what  will  be  called  the  random  vectors  machine  is  a 
surprisingly  good  approximation  to  an  optimum  machine  and  one  can  easily  compute  its 
expected  error.  Thus  the  random  vectors  machine  provides  an  easy  way  to  compute  an  upper 
bound  for  the  expected  error  of  an  optimum  machine. 

Assume  then,  that  L  vectors  are  selected  independently  and  randomly  from  an  equiprob- 
able  space  of  binary  vectors  of  length  n  .  These  are  to  be  the  learnable  vectors  for  the  random 
vectors  machine  and  learning  will  occur  by  having  the  machine  choose  which  of  its  learnable 
vectors  is  closest  to  the  target  behavior. 

Let  Pt{d)  represent  the  probability  that  the  nearest  random  vectors  machine  learnable 
vector  to  a  randomly  selected  vector  will  be  exactly  d  away.  It  can  be  shown  that 


Pz{(^)  = 


ft  ~d 

eO 

i  «0 

1 

n-rf-l 

E  ( ■) 

1 

2" 

2" 

n 

The  expected  error  for  the  random  vectors  machine  is  i  Pz{i)  where  Pz  is  evaluated  as 

1=0 

given.  .As  mentioned  above,  this  formula  gives  an  upper  bound  to  and  estimates  well  the 
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expected  error  for  the  optimum  machine.  Some  of  its  values  are  plotted  in  the  chart  given  in 
the  following  section. 

Comparing  Some  Expected  Errors 

Figure  15  gives  the  expected  error  for  the  various  machines  discussed  in  this  paper  for  the 
configuration  p  =4;  signature  table,  threshold,  k-CNT,  truncation,  Gl,  random  vectors,  and 
lower  bound.  As  mentioned  above,  the  Gl  and  the  random  vectors  machines  fairly  well  approx¬ 
imate  the  behavior  of  optimum  machines.  The  Gl  machines  give  a  construction  procedure  for 
the  learnable  classes  and  the  random  vectors  study  provides  an  approximate  calculation  for  the 
achievable  minimum  expected  error.  However,  neither  of  these  classes  has  a  straightforward 
realization  appropriate  for  applications. 

The  more  traditional  machines  based  on  signature  tables  and  threshold  functions  compare 
surprisingly  well  with  optimum  behavior  often  yielding  error  rates  on  the  order  of  20  to  30  per¬ 
cent  above  optimum  in  the  ranges  investigated.  This  indicates  the  learnable  classes  for  these 
machines  are  fairly  well  distributed.  Furthermore,  no  evidence  came  from  this  study  to  make 
one  prefer  one  of  these  systems  over  any  other  on  the  grounds  of  expected  error.  The  choice  of 
which  system  to  use  should  probably  be  dominated  by  the  quality  of  the  learning  algorithms  or 
the  specific  characterizations  or  properties  of  the  learnable  classes. 

The  worst  behavior  observed  here  is  for  the  naive  truncation  machine  which  it  was  origi¬ 
nally  suggested  might  be  optimal. 

The  lower  bound  curve  has  a  scallop  shape  with  one  lobe  for  each  value  of  D.  The  right¬ 
most  lobe  corresponds  to  the  case  D=1  wh^re  every  vector  in  the  space  is  either  learnable  or 
one  aw'ay  from  learnable.  The  second  lobe  from  right  graphs  the  case  D=2  and  so  forth.  The 
intersection  points  between  the  lobes  are  the  places  where  (1)  holds  as  an  equality;  the  Sq  set 
around  each  Vq  has  filled  a  sphere  of  size  D  and  a  larger  Sq  set  would  move  one  into  the  case 


D-fl.  These  are  the  places  where  the  coding  theorists  look  for  perfect  codes 


Expectsd  Error 
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—  LowerBound 
■o-  Random 

*  G1 

*  SIg.  Table 
D  Truncation 

*  Threshold 
■  K-CNF 


Figure  15.  Expected  error  versus  log  L  for  the  machines  examined 

in  this  paper  [n  =16). 
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Conclusions 

The  main  results  of  this  paper  are  to  emphasize  the  fundamental  difficulty  of  learning.  If 
the  learning  is  to  be  done  on  the  basis  of  a  reasonable  amount  of  information,  then  the  learning 
machine  will  be  realization  sparse  and  learned  behaviors  will  typically  be  only  small  improve¬ 
ments  over  a  random  decision  maker.  If  the  target  behavior  is  to  be  learned  accurately,  the 
learning  requires  the  observation  of  a  very  large  fraction  of  the  set  of  all  possible  behaviors. 
The  machine  acts  more  or  less  as  if  it  were  doing  rote  memorization.  Relationship  (-4)  gives  a 
lower  bound  on  the  number  log;(Z,p  j  of  needed  observations  required  to  achieve  accuracy  X 

Many  well  known  learning  machines  are  realization  sparse  as  is  the  case  for  the  signature 
tables,  linear  models,  and  conjunctive  normal  form  machines  examined  here.  These  machines 
are  popular  because  of  their  fast  and  effective  learning  algorithms  but  the  number  of  behaviors 
that  they  can  learn  is  relatively  small.  Thus,  it  was  possible  for  Minsky  and  Papert  [Sj  to  prove 
a  large  number  of  noncomputability  results  for  the  perceplron.  Presumably  similar  results  could 
be  proven  for  any  of  the  other  realization  sparse  classes.  The  only  way  such  machines  can  func¬ 
tion  very  effectively  is  if  the  class  of  target  behaviors  corresponds  closely  to  the  class  of  learn- 
able  behaviors. 

While  the  realization  sparse  machines  can  learn  few  functions  precisely,  they  can  converge 
toward  many  functions  at  least  to  some  degree.  Thus  Samuel  [13,14],  Sejnowski  [15],  and  others 
have  observed  improvements  in  system  behaviors  using  these  models.  However,  the  results  here 
make  one  suspect  that  there  probably  were  vastly  more  accurate  behaviors  in  their  respective 
function  spaces,  but  they  did  riot  have  the  means  to  converge  upon  them. 

Human  learning  probably  does  not  include  the  capability  to  acquire  general  functions 
unless  p  is  effectively  very  small,  say  2  or  3.  Human  learning  probably  depends  heavily  on  rote 
memorization  and  the  ability  to  recognize  perturbations  on  the  original  patterns. 
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An  interesting  dimension  for  studying  classes  of  Icarnable  behaviors  relates  to  their  distri¬ 
bution  across  the  total  space  of  behaviors.  It  was  determined  here  that  for  small  p,  the  specific 
models  studied  (signature  tables,  linear  models,  conjunctive  normal  form  systems)  deviate  sub¬ 
stantially  from  uniform  distributions.  This  opens  the  question  as  to  whether  machines  with 
lower  expected  errors  might  be  found,  and  the  Gl  machine  exhibits  the  type  of  learnable 
behaviors  that  would  achieve  this  goal. 

The  implications  of  this  work  for  the  many  studies  now  ongoing  in  machine  learning  (see, 
for  example  [2,6,7])  are  a  matter  for  additional  research.  The  emphasis  here  is  that  there  is  a 
necessary  tradeoff  between  rate  of  learning  and  achievable  accurac)  for  '  ny  learning  machine. 
In  some  cases,  it  is  possible  to  measure  these  parameters  accurately  and  to  understand  where  a 
particular  machine  lies  on  the  spectrum.  Until  these  issues  are  more  fully  understood,  the 
design  of  learning  systems  will  remain  a  “black  art”. 


47 


Appendix  A 


Counting  Signature  Tables 

This  paper  and  earlier  papers  have  shown  the  type  of  functions  that  can  be 

represented  (or  learned)  by  signature  tables.  It  is  clear  that  many  functions  cannot  be 
represented  if  the  sizes  of  internal  alphabets  are  restricted  and  the  purpose  of  this  section  is  to 
evaluate  how  serious  the  limitations  are.  The  methodology  given  here  is  an  improved  and 
simplified  version  of  that  given  in  Biermann  el  al.  [ij.  If  one  can  compute  the  size  L  of  the 
learnable  class,  it  becomes  possible  to  estimate  the  probability  that  a  randomly  selected 
behavior  can  be  learned.  One  can  also  use  L  in  formula  (l)  to  obtain  a  lower  bound  on  the 
worst  case  error  that  could  possibly  be  encountered. 

One  can  compute  the  number  L  of  functions  computable  by  the  table  system  of  Figure  11 
assuming  r,-  <  2  for  i=l,2,...,6,  and  in  the  process  derive  a  general  methodology  for  handling 
any  such  table  system.  It  turns  out  that  one  cannot  simply  compute  the  number  of  settings  of 
all  of  the  2S  binary  constants  in  the  table,  a  total  of  2"®,  because  a  single  function  may  be  realiz¬ 
able  by  many  different  settings.  So  one  must  count  only  nonredundant  settings. 

Consider,  as  a  start,  the  portion  of  the  tree  labelled  /  3.  The  output  alphabet  size  1-3 
could  equal  one  in  which  case  the  output  column  would  be  either  (0,0,0, 0)  or  (1,1, 1,1).  But  the 
class  of  functions  computable  by  the  whole  signature  table  system  is  the  same  regardless  of 
which  is  selected.  That  is,  the  output  leaving  /  3  is  simply  an  internal  code  for  the  system  and 
0  and  1  are  abstract  symbols.  Adjustments  can  be  made  to  the  top  table  of  /  j  to  obtain  the 
same  system  behavior  regardless  of  whether  the  /  3  output  vector  is  (0,0, 0,0)  or  (1,1, 1,1).  Thus 
one  can  conclude  that  the  number  of  nonredundant  configurations  of  /  3  with  output  alphabet 


of  size  one  is  1  and  this  is  written  as 
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A’(/3.1)=  1- 


ir  i’3=2  then  fourteen  vectors  are  possible,  (0.,0,0,]),  (0,0,1 ,1 ,1 ,0).  Again,  as  explained  in 
[l],  since  0  and  1  are  abstract  symbols  with  only  internal  meaning,  the  vectors  (0, 0,0,1)  and 
(1,1, 1,0)  result  in  identical  classes  of  behaviors  so  only  one  of  them  is  counted.  In  fact,  only  half 
of  the  fourteen  vectors  are  nonredundant;  (0,0,0,l)  (0,0, 1,0)  (0, 0,1,1)  (0,1, 0,0)  (0,1, 0,1)  (0,1, 1,0) 
(0,1, 1,1).  Thus  A'(/ 3,2)=7.  In  general,  the  number  of  configurations  of  t  with  q 

binary  inputs  and  output  vocabulary  of  size  v  when  t  is  a  single  bottom  level  table  is 


N{(,v)  = 


I  =0 


if  <  V 


otherwise 


V  ! 


(3) 


as  derived  in  [l]. 

Consider  next  the  subtree  /  j  in  Figure  11.  First  the  number  C  (/  i,v  ,z )  of  nonredundant 
configurations  of  the  top  table  in  /  i  will  be  computed  as  a  function  of  v,  the  output  vocabulary 
size  and  z,  a  vector  giving  the  sizes  of  all  the  input  vectors  to  the  top  table  of  /  j.  Then  the 
number  A^(/,i;)  of  nonredundant  configurations  for  the  whole  /  j  tree  will  be  determined.  Let 
b(t)  stand  for  the  branching  factor  below  the  top  table  in  tree  t.  Then  z  will  be  a  vector  of 
length  b(t).  6  (/  i)=2  in  Figure  11. 

Assuming  that  the  highest  table  in  /  j  is  not  the  top  table  or  a  bottom  table  in  the  com¬ 
plete  table  system  and  that  this  highest  table  has  input  and  output  vocabularies  of  size  two,  it 
is  show'n  in  [l|  that 


C(/  „2,^)  = 


o(f  -<  ) 


“  1=0 


< 


if  g  =0 
other  vise 
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vhcre  q  is  the  number  of  binary  inputs.  In  the  case,  q  =  2  and 
C(/i,2,(2,2))  =  5. 

The  five  nonredundant  vectors  are  (0,0, 0,1),  (0,0, 1,0),  (0,1, 0,0),  (0,1, 1,0),  (0,1, 1,1).  (The  vector 
(0,1, 0,1)  is  redundant  because  it  yields  the  same  value  regardless  of  the  output  of  /  3.  (0,0, 1,1) 
is  redundant  because  it  yields  the  same  value  regardless  of  the  value  of  /  ^.) 

The  number  of  nonredundant  subtrees  /  j  with  output  alphabet  of  size  v  and  input  alpha¬ 
bets  from  subtrees  /  3  and  /  <  of  sizes  >3  and  is  the  product  of  the  number  of  nonredundant 
top  tables  times  the  product  of  the  numbers  of  nonredundant  subtrees. 

^  if  li*-'  .(*  3.*  •l))-’''^(/  3.*3)^^(/  ^l) 

But  the  inputs  to  the  top  table  may  have  alphabet  sizes  of  1  or  2,  so  A'(/  i,r)  is  the  sum  over 
all  such  values. 

A'(/  1.0  =  E  C{f 

all  (1  1,1  j) 

IS' iS* 

I<'2<2 

More  generally  a  tree  t  may  have  a  branching  factor  of  b(t)  below  the  top  table  and  input 
alphabets  to  the  top  table  from  1  to  .  The  counting  formula  is 

»(«) 

A'(f,u)=  E  (?(<  ,v  ,z  )  n  A'(su6tree  (A  ,<  ),enfry  (h  ,r  )) 

all  k  >>i 

0/ ftfH$  J 

where  subtree  (h,t)  returns  the  h-th  child  below  the  highest  table  in  t  and  entry  (h,z)  selects  the 
h-th  entry  in  z. 

Equation  (4)  counts  the  number  of  distinct  functions  computable  by  any  subtree  t  in  a  sig¬ 
nature  table  system  except  for  (a)  the  case  where  t  is  a  bottom  subtree  which  includes  only  one 
table  (and  (3)  is  used)  and  (b)  the  case  where  t  is  the  whole  signature  table  system  (and  a 


special  form  for  C(t,v,z)  is  used).  Applying  (-1)  to  /  j  yields 


A’(/  1.2)  =C{t  ,2,(2,2))A'(/  3.2)A’(/  ,.2) 

^  C{t  ,2,{1,2))A'(/3,1)A-(/„2) 

+  C(t,2,(2,l))A^(/3,2)A'(/„l) 

+  C(t,2,(l.l)).V(/3.1)A'(/„l) 

=  5-7-7-M  1-7+1  7  1+0  1  1  =  259 

In  the  case  of  output  vocabulary  of  size  one,  (4)  yields 

^’if  1,1)  =C{t  ,1,(2.2))A'(/  3,2)A'(/  „2) 

+  C(/,l,(1.2))A'{/3.1)A-(/,.2) 

+  C(M,{2,1))A^(/3.2)A’{/„1) 

+  C(M,{1,1))A’(/  3.1)A’(/4.1) 

=  0-7-7+01-7+0-71  +  Ml  =  1 

The  top  table  in  Figure  11  may  have,  according  to  Jl], 

cu  ,v,‘)=  E (-I)' (>='"' 

I  =0 

■where  q  is  the  number  of  binary  inputs.  In  this  example,  v=2  and  q=2,  yielding 

CU  ,2,(2,2))  =  10. 

So  the  total  number  of  functions  representable  by  f  with  an  output  vocabulary  of  size 

N{f  ,2)  =  CU  ,2,{2,2))NU  i,2)N(/  2,2) 

+  CU  ,2,(2,1)),V(/  i,2)N(/  0,1) 

+  C7(/  ,2,(1,2))/V(/  i,l)A’(/o,2) 

+  CU  ,2,(1,1)N(/„1)/V(/  2,1) 

=  10  259  259+2  1-259+2-259  1+2-M 


=  671,848 

For  output  vocabulary  size  of  one 
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A'(/  .1)  =  2 

This  means  that  the  total  number  of  functions  representable  by  the  system  of  Figure  11 
under  the  specified  vocabulary  limitations  is  L —N  {J  ,l)+iV(/  ,2)=671,S50.  Comparing  L 
with  the  total  number  of  functions  on  eight  variables,  2~  =2"'"®  =  1.16  X  10",  one  can  set  that 
only  a  very  small  fraction  of  the  set  of  all  functions  can  be  learned.  Furthermore,  using  (1)  it  is 
possible  to  find  a  lower  bound  on  the  worst  error  D  that  could  occur  in  trying  to  learn  a  target 
behavior.  It  yields  D  >  91  which  means  that  there  exist  target  functions  in  which  the  best 
learning  algorithm  may  produce  errors  on  91  out  of  256  possible  inputs.  Considering  that  a  ran¬ 
dom  decision  maker  would  be  wrong  only  128  out  of  256  times,  one  can  see  that  the  system  is 
capable  of  very  poor  performance. 

Severe  as  these  elTects  may  be,  they  are  exponentially  worse  in  deeper  trees.  This  can  be 
seen  by  bounding  the  counting  functions  given  above  and  deriving  a  general  bound  for  number 
of  functions  versus  the  depth  of  the  tree.  Define  the  depth  d  to  be  the  number  of  levels  of 
tables  in  binary  table  systems  of  the  form  of  Figure  11.  (Thus  d=3,  for  Figure  11.)  Consider 
such  tables  of  any  depth  d  >1  with  all  vocabulary  sizes  limited  to  two  or  less. 

For  bottom  level  tables  J  iotum  ,  ^  above,  A^(/ .2)=7.  For  intermediate  level  tables 

f  level  ’ 

level  if  level -i.~)?+small  terms 

Similarly,  for  the  top  level  function 

NUiop  ,2)<11iA'(/,,,.„2)]=. 

Applying  these  at  several  levels  yields 
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d=2  L  <U{7') 

d=Z  L  <11(6(7)2)2 


d=A  Z,  <11(6(6(7)2)2)2 

(i=5  I  <11(6(6(6(7)2)2)2)2  . 


In  general, 


L  <11(6 


E  2‘ 

1U6'*=‘  72'") 


One  can  replace  2‘  with  2"22‘^”'  to  obtain 

1=1 


L  <ll(6-=)(6='-'  -=■-')  =  it 


But  the  total  number  of  functions  for  such  a  table  system  of  depth  d  is  22*  .  The  ratio  of  realiz¬ 


able  functions  to  actual  functions  is  thus 


1  I2«(2' 


This  reduces  to 


If  d  is  greater  than  2,  the  ratio  is  less  than 


3(22'*''’’) 


So  the  ratio  of  the  number  of  realizable  functions  to  actual  functions  decreases  dramatically  as  d 


increases. 
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Appendix  B 


The  Average  Error  for  Truncation  Machines. 

The  average  error  for  truncation  machines  can  be  computed  by  the  formula 


average  error  — 


{size  of  error) 

all  itclcri 


number  of  vectors 

Consider  first  the  problem  of  computing  average  error  for  a  single  segment  of  q  identical  bits. 
Then  the  total  average  error  will  be  obtained  by  multiplying  by  the  number  of  segments,  n  'q. 


average  error  per  segment  — 


^  {number  of  segments  with  j  l'  s  )  X  {error  if  i  l'  s) 

i  =0 _ 

0? 


u 


1 


average  error  per  segment  —  - \Y^{number  of  segments  with  i  1'  s  )  X  i 

2*  j  =0 


+  {number  of  segments  with  1'  s  )  X  — 


+  {number  of  segments  with  i  1'  5)  X  (?-»)] 


1 

2? 


4-1 


E  (?)«■  + 


i=0 


-  g 

9 

9 

f+  E  (?)(«-) 

2 

\ 

1 

ni 

f' 

' 

1 

E(?)'  + 

i  =0 

9 

^  1=0 

2 

1 

2*  -1 

■f- 

E(>  + 

t  =0 

9 

± 

9 

4 

2 
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(But  for  q  even,  the  identity 


This  is  multiplied  by  the  number  of  segments  n/q  to  obtain 
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